Part 4. Regression (part b)
OLS estimator: \[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\, \left(Y - (b_0 + b_1X_{[1]} + \ldots + b_K X_{[K]})\right)^2\]
Could solve by trying every combination of \(b_0, b_1, \ldots, b_K\).
Let’s try to understand a more efficient solution: \[\hat{\boldsymbol{\beta}} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbf{Y}\]
\[(\hat{\beta_0}, \hat{\beta_1}) = \underset{(b_0, b_1) \in \mathbb{R}^{2}}{\arg\min} \, \frac{1}{n} \sum_{i =1}^n\, \left(Y_i - (b_0 + b_1X_i)\right)^2\]
Expanding the sum, \(S\), we get
\[\begin{align} S &= \sum_{i = 1}^n \left(Y_i^2 - 2\beta_0 Y_i - 2\beta_1 X_iY_i + \beta_0^2 + 2\beta_0 \beta_1 X_i + \beta_1^2 X_i^2\right) \\ &= \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2 \end{align}\]
We want to choose \(\beta_0\) and \(\beta_1\) to minimize \(S\).
How would you choose \(\beta_0\) to minimize \(S = -2\beta_0 + \beta_0^2\)?
Observing that the function is flat only at the minimum, we use differentiation to obtain the first order condition:
\[\frac{\partial S}{\partial \beta_0} = -2 + 2 \beta_0 = 0\]
So the solution is \(\beta_0 = 1\).
How would you choose \(\beta_0\) and \(\beta_1\) to minimize \(S = -2\beta_0 + \beta_0^2 - 3 \beta_1 + \beta_1^2 + \beta_0 \beta_1\)?
Taking same approach, first order conditions are
\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 + 2 \beta_0 + \beta_1 = 0 \\ \frac{\partial S}{\partial \beta_1} &= -3 + \beta_0 + 2\beta_1 = 0 \end{align}\]
A system of equations!
\[\begin{align} 2 \beta_0 + \beta_1 = 2 \\ \beta_0 + 2\beta_1 = 3 \end{align}\]
By isolating and substituting, we get \(\beta_0 = 1/3\) and \(\beta_1 = 4/3\).
To find \(\beta_0\) and \(\beta_1\) that minimize \[S = \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2,\] we use the same approach.
\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 \sum Y_i + 2 n \beta_0 + 2\beta_1 \sum X_i &= 0\\ \frac{\partial S}{\partial \beta_1} &= -2 \sum X_i Y_i + 2 \beta_0 \sum X_i + 2 \beta_1 \sum X_i^2 &= 0 \end{align}\]
System of equations:
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
Isolating \(\beta_0\) in the first equation, we get \[ \beta_0 = \frac{\sum Y_i}{n} - \beta_1 \frac{\sum X_i}{n}\]
Substituting into second equation, rearranging, dividing top & bottom by \(n\), we get \[\beta_1 = \frac{\frac{\sum X_iY_i}{n} - \frac{\sum X_i}{n}\frac{\sum Y_i}{n}}{\frac{\sum X_i^2}{n} - \left(\frac{\sum X_i}{n}\right)^2}\]
These are the sample analogues of the bivariate BLP (where \(\beta_1 = \frac{\text{Cov}[X,Y]}{{\textrm V}\,[X]}\)).
We have shown how, with a bit of calculus, we can derive the OLS estimator in the case with one predictor.
If we want \(K > 1\) predictors, could do the same thing:
A horrible mess.
To make things easier, we turn to some linear algebra.
A matrix \(\mathbb{B}\) is a rectangular table of numbers, e.g.
\[ \mathbb{B} = \begin{bmatrix} 1 & 2 \\ 4 & 0 \\ -1 & 3 \end{bmatrix}\]
This is a \(3\times 2\) matrix (rows \(\times\) columns).
The transpose of \(\mathbb{B}\), written \(\mathbb{B}^T\) or \(\mathbb{B}'\), is
\[ \mathbb{B}^T = \begin{bmatrix} 1 & 4 & -1\\ 2 & 0 & 3 \end{bmatrix}.\]
\(k\)th row of \(\mathbb{B}\) is \(k\)th column of \(\mathbb{B}^T\) (and vice versa).
When we multiply a matrix \(\mathbb{B}\) with \(K\) columns by a column vector of length \(K\), \(\mathbf{D}\), we get a column vector of length \(K\):
R# equivalent ways to make B:
B <- cbind(c(1, 4, -1), c(2, 0, 3))
(B <- matrix(data = c(1, 4, -1, 2, 0, 3), ncol = 2)) [,1] [,2]
[1,] 1 2
[2,] 4 0
[3,] -1 3
Can only multiply \(\mathbb{B}\) and \(\mathbf{D}\) if number of columns in \(\mathbb{B}\) equals number of rows in \(\mathbf{D}\).
Recall our system of equations:
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
We can write this in matrix form as
\[ \mathbb{B} \boldsymbol{\beta} = \mathbf{D} \]
where
\[\mathbb{B} = \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix}, \, \, \, \,\, \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \, \, \, \,\, \mathbf{D} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]
The inverse of a scalar (number) \(a\) is written \(a^{-1}\), and \(a \times a^{-1} = 1\). (e.g. \(3^{-1} = 1/3\), so \(3 \times 1/3 = 1\).)
We multiply both sides by the inverse of \(a\) when we do
\[\begin{align} a x &= b \\ x &= b/a \end{align}\]
The inverse of a matrix \(\mathbb{B}\) is written \(\mathbb{B}^{-1}\).
We can similarly solve our problem using the inverse of \(\mathbb{B}\):
\[\begin{align} \mathbb{B} \boldsymbol{\beta} &= \mathbf{D} \\ \boldsymbol{\beta} &= \mathbb{B}^{-1} \mathbf{D} \end{align}\]
This is a useful way to represent the solution to a system of equations.
There are manual procedures for inverting matrices. We use R’s solve() function:
# creating our matrices for one toy dataset
x <- c(1, 2, 3)
y <- c(1, 3, 3)
(B <- cbind(c(length(x), sum(x)), c(sum(x), sum(x^2)))) [,1] [,2]
[1,] 3 6
[2,] 6 14
[1] 7 16
So we have (1) a way to represent systems of equations and (2) a way to solve them.
Let \(\mathbb{X}\) be the \(n \times K + 1\) regressor matrix (a column of 1s and 1 column per predictor):
\[\mathbb{X} = \begin{bmatrix} 1 & X_{[1]1} & X_{[2]1} & \cdots & X_{[K]1} \\ 1 & X_{[1]2} & X_{[2]2} & \cdots& X_{[K]2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{[1]n} & X_{[2]n} & \cdots & X_{[K]n} \end{bmatrix}\]
Let \(\boldsymbol{\beta}\) be the \(K+1\)-length column vector of coefficients: \[\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{K} \end{bmatrix}\]
Let \(\mathbf{Y}\) be the \(n\)-length column vector of outcomes: \[\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_{n} \end{bmatrix}\]
For a given coefficient vector \(\boldsymbol{\beta}\),
OLS estimator in matrix form:
\[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) \]
Now we do minimization via differentiation for this version of the problem.
Expand the sum:
\[ (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) = \mathbf{Y}^T \mathbf{Y} - \mathbf{Y}^T \mathbb{X} \boldsymbol{\beta} - \boldsymbol{\beta}^T \mathbb{X}^T \mathbf{Y} + \boldsymbol{\beta}^T\mathbb{X}^T \mathbb{X} \boldsymbol{\beta}\]
Differentiating with respect to \(\boldsymbol{\beta}\) gives us the first order condition(s), \[2 \mathbb{X}^T \mathbb{X}\boldsymbol{\beta} - 2 \mathbb{X}^T \mathbf{Y} = 0,\]
which becomes the system of equations \[\mathbb{X}^T \mathbb{X}\boldsymbol{\beta} = \mathbb{X}^T \mathbf{Y} \]
with solution \[ \boldsymbol{\beta} = \left(\mathbb{X}^T \mathbb{X}\right)^{-1} \mathbb{X}^T \mathbf{Y}.\]
Our system of equations was:
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
In matrix form:
\[\begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]
This is \(\mathbb{X}^T \mathbb{X} \boldsymbol{\beta} = \mathbb{X}^T \mathbb{Y}\):
\[\begin{align}\mathbb{X}^T \mathbb{X} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_3 \end{bmatrix}\\ &= \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2\end{bmatrix} \end{align}\]
\[\begin{align}\mathbb{X}^T \mathbf{Y} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \\ &= \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix} \end{align}\]
Most important to know about OLS:
Today we have added some intuition about how OLS is estimated:
For walk-through of the optimization part of matrix OLS, see Ben Lambert’s econometrics videos:
Also see video on variance of OLS estimator given homoskedasticity